Skip to content

feat(prometheus): request/limit overlay, rightsizing, HPA + PVC + restart charts#753

Open
nadaverell wants to merge 2 commits into
mainfrom
feature/inline-prometheus-insights
Open

feat(prometheus): request/limit overlay, rightsizing, HPA + PVC + restart charts#753
nadaverell wants to merge 2 commits into
mainfrom
feature/inline-prometheus-insights

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

@nadaverell nadaverell commented May 20, 2026

Summary

Expands inline Prometheus surface on resource detail pages without trying to become a Grafana replacement. Five additions, all gated on graceful degradation — each is hidden silently when its data source isn't present (no nag, no "no data" panels).

1. Request/limit overlay on CPU + memory charts. Dashed reference lines on the existing area chart, summed across runtime containers (pure init excluded, native sidecars with restartPolicy: Always included) × readyReplicas for replicated workloads. The chart's Y axis auto-extends to fit the limit line so it doesn't clip when usage runs hot.

2. Rightsizing strip on Deployment / StatefulSet / DaemonSet full-screen Metrics tab. Backed by a new /prometheus/rightsizing/{kind}/{ns}/{name} endpoint that issues P95-over-24h subqueries per container. Tone policy is deliberately mild — most workloads are 2–3× over-provisioned and that's fine:

  • ≤3× ratio → Well-sized, no badge
  • 3–5× CPU / 3–5× mem → Nx headroom, neutral
  • 5×+ memory or 8×+ CPU → info "could reduce" (muted blue, not red)
  • No request set → warning "consider setting"
  • P95 > CPU limit → alert "throttling likely" (orange)
  • Memory P95 ≥ 95% of limit → critical "OOM risk" (red — the only red case)

3. HPA replicas chart. Bottom of HPA detail. Current/desired as two-line chart, min/max as reference lines. Sourced from kube_horizontalpodautoscaler_status_{current,desired}_replicas. Observed-CPU-vs-target chart deferred — KSM doesn't expose the observed value reliably across versions.

4. PVC usage gauge. Single-line capacity bar on the PVC renderer via kubelet_volume_stats_{used,capacity}_bytes. Traffic-light tone at 75%/90%. Hides silently when CSI doesn't NodeGetVolumeStats or Prom isn't scraping kubelet (notably GMP default).

5. Restart event lane. Below the Metrics chart on workload + Pod detail. Vertical markers on a dedicated row rather than overlaying the chart waveform — clusters of restarts stay readable. Uses changes(kube_pod_container_status_restarts_total[1h]); hidden when KSM isn't reporting.

Brittleness

Every new feature gracefully degrades:

  • KSM missing (Datadog Agent default, AMP without operator) → rightsizing + HPA + restart lane all hide
  • CSI driver lacks NodeGetVolumeStats (or k8s 1.34 volume-stats regression) → PVC gauge hides
  • GMP default config (kubelet not scraped) → PVC gauge hides
  • Prom not connected → existing PrometheusCharts empty state still surfaces "Discover Prometheus" CTA

Architecture notes

  • Adds HPARenderer and PVCRenderer to RendererOverrides so the host can wrap them with Prom-backed sections via extraSections without touching the base renderers' core layout.
  • MetricsTabContent composes RightsizingStrip + PrometheusCharts + RestartEventLane on the workload Metrics tab. RightsizingStrip is gated on expanded so it doesn't appear in drawer mode (wrong granularity for "what is this" view).
  • All new endpoints are gated on the existing Prom client connection. No new capability probe — graceful per-query degradation is the existing pattern in PrometheusCharts.

Test plan

  • go build ./..., go vet ./..., npm run tsc, make build all clean
  • go test ./internal/prometheus/... passes
  • Visual test on a Prom + KSM cluster to confirm rightsizing strip reads as neutral on typical workloads (this is the load-bearing UX check)
  • Visual test on a cluster without KSM to confirm restart lane / HPA charts / rightsizing all disappear silently
  • Visual test on EBS/GCE PD PVC to confirm the gauge populates; on a CSI driver without NodeGetVolumeStats to confirm it hides

Note

Medium Risk
Adds new Prometheus-backed API endpoints and UI surfaces (rightsizing recommendations, PVC usage, restart/HPA charts) plus request/limit overlays, which could impact performance and correctness of metrics and involves RBAC gating on cached K8s specs.

Overview
Expands the Prometheus feature set with two new backend endpoints: /prometheus/pvc/{namespace}/{name} (PVC usage from kubelet_volume_stats_*) and /prometheus/rightsizing/{kind}/{namespace}/{name} (per-container CPU/memory P95-based request recommendations), both protected by a new request-scoped AuthGate wired from server.canRead.

Adds a new Prometheus metric category restarts (PromQL changes(kube_pod_container_status_restarts_total[1h])) and updates the web UI to surface restart event markers, an HPA replicas-over-time chart, and a PVC usage gauge, all designed to hide when Prometheus/KSM/kubelet series are unavailable.

Enhances existing Prometheus CPU/memory charts with request/limit reference-line overlays computed from the resource’s pod spec (including native sidecars, excluding pure init containers), and adds a workload metrics header strip showing rightsizing recommendations for supported workload kinds.

Reviewed by Cursor Bugbot for commit c80200f. Bugbot is set up for automated code reviews on this repo. Configure here.

…tart charts

Expands inline Prometheus surface on resource detail pages without trying to
become a Grafana replacement. Five additions, each hidden silently when its
data source isn't present (no nag, no "no data" messages).

* Request/limit dashed reference lines on existing CPU + memory area charts,
  summed across runtime containers (excluding pure init, including native
  sidecars) × readyReplicas for replicated workloads.
* Rightsizing strip on Deployment/StatefulSet/DaemonSet full-screen Metrics
  tab. P95 over 24h with KRR-style headroom (15% CPU, 10% memory). Tone
  policy is mild: 2-3x headroom reads as "well-sized", only >5x mem or >8x
  CPU surfaces as info "could reduce". Red reserved for actual OOM risk
  (memory P95 >= 95% of limit); orange only for confirmed CPU throttling.
* HPA detail page gets a replicas chart (current/desired lines + min/max
  reference lines) via KSM.
* PVC renderer gets a single-line usage gauge via kubelet_volume_stats_*.
  Hidden silently when the CSI driver doesn't report or Prom isn't scraping
  kubelet (notably GMP default).
* Restart event lane below the Metrics chart — vertical markers on a
  dedicated row rather than overlaying the waveform, so clusters of restarts
  stay readable.

Brittleness mitigations: every new feature gates on a query that returns no
series when its dependency (KSM, kubelet scrape, CSI NodeGetVolumeStats) is
missing, and hides rather than rendering an error or "not configured" panel.
PrometheusCharts' existing empty state still surfaces the "Discover
Prometheus" CTA when nothing is connected.

Adds `HPARenderer` and `PVCRenderer` to RendererOverrides so the host can
wrap them with platform data hooks without modifying the base renderers'
core layout.
@nadaverell nadaverell requested a review from hisco as a code owner May 20, 2026 14:53
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 802c80d. Configure here.

}
}

ratio := reqVal / p95
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division by zero when P95 usage is zero

Medium Severity

When p95 is exactly zero (e.g., an idle container with no CPU usage over 24h), ratio := reqVal / p95 produces +Inf. This flows into fmt.Sprintf("Over-provisioned by %.1fx — could reduce", ratio) which renders as "Over-provisioned by +Infx — could reduce" in the user-facing rightsizing strip. A guard for p95 <= 0 (or near-zero) before computing the ratio would prevent this broken display.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 802c80d. Configure here.

* Reference line scale: PromQL is per-pod, drop the readyReplicas multiplier
  so the line lands on the same axis as the chart's per-pod series.
* RightsizingResponse.Rows: initialize as []RightsizingRow{} in the no-
  containers branch so the wire format matches the TS contract (non-null).
* loadWorkloadContainers: return sentinel errors so the handler can map
  cache-not-ready to 503 and RBAC-denied (nil lister) to 403 instead of
  reporting both as 404.
* PVC usage handler: log Prom query failures via errorlog so operators
  can distinguish "Prom unhealthy" from "CSI doesn't report".
* RBAC gate: new SetAuthGate hook in the prometheus package; server wires
  it to canRead. The new rightsizing + PVC endpoints now require the
  caller to be able to "get" the underlying resource before reading from
  the SA-populated informer cache.
* Restart event lane: emit a marker only when the rolling-window count
  increases (or first-sample-nonzero). The previous "every positive
  sample" rule turned one restart into ~60 markers + a ~60× total.
* readQuantity: NaN guards on the suffix paths so malformed YAML can't
  poison the request/limit sum.
* HPACharts + RestartEventLane: consume the React Query error field so
  Prom-side failures get a console.warn breadcrumb instead of silently
  looking identical to no-data.
* PVCUsageBar: paired light/dark text tones (text-red-700 / text-red-400)
  so the percentage stays legible in both themes.
* Comment drift: "Pods are too granular" reworded to match the gate;
  PVC label-fallback comment trimmed to match what the code actually does.
* Add rightsizing_test.go: tone-classifier boundaries (3x/5x/8x ratios,
  0.95 mem-OOM threshold), recommendRequest 10m/16Mi rounding, native-
  sidecar inclusion, formatRightsizingValue edge cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant